Nature Computational Science
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match Nature Computational Science's content profile, based on 50 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.
XIE, R.; Zhukova, A.; Pena, P. G.; Iglesias, G.; Hu, S.; Wang, J.; Tsang, T. K.; Dhanasekaran, V.; Kraemer, M. U. G.; Pybus, O. G.; Gascuel, O.
Show abstract
Infectious disease dynamics can be inferred from pathogen genomic data using phylodynamic methods, but the applicability of many such approaches to large data sets is constrained by computational cost. Recent deep-learning approaches to phylodynamics have improved scalability, yet challenges remain when genetic divergence is limited during fast spreading outbreaks. To address this, we use pathogen-specific models to show that deep-learning models trained on outbreak-like phylogenies can accurately estimate the reproductive number (R) when both the birth-death model and the expected phylogenetic resolution are matched to the target pathogen, highlighting the importance of realistic training conditions. Focusing on three major respiratory pathogens of public health importance (SARS-CoV-2, seasonal human influenza virus, and respiratory syncytial virus (RSV)), we introduce PhyloRt, a scalable framework for estimating the time-varying reproductive number (Rt) from large outbreak phylogenies. PhyloRt decomposes large trees into overlapping subtrees and applies a hierarchical deep-learning-based inference strategy to classify subtrees as exhibiting constant or time-varying reproduction numbers, enabling identifiable and computationally efficient estimation of Rt as a piecewise-constant trajectory through time. Applications to SARS-CoV-2 and influenza outbreaks show that PhyloRt recovers transmission dynamics consistent with estimates derived from mathematical epidemiological and Bayesian phylodynamic analyses. Our work enables scalable and rapid estimation of time-varying transmission dynamics from very large-scale outbreak genomic data sets, supporting real-time genomic epidemiology of emerging pathogens. SignificanceEstimating changes in transmission dynamics over time is important for responding to infectious disease outbreaks. Current methods mostly rely on reported case data from epidemiological surveillance, which can be biased or incomplete due to variable testing capabilities, particularly in resource-limited settings. A complementary approach is to use viral genomes as an alternative data source. However, inferences from genomic data can be computationally intensive and have mainly been applied retrospectively. We present PhyloRt, a scalable deep-learning-based phylodynamic framework that enables fast inference of the time-varying reproductive number (Rt) from large outbreak phylogenies. Our approach is widely applicable and provides a practical approach to monitoring epidemic dynamics, complementing traditional surveillance and supporting timely public health decision-making.
Fielding, J. J.; Wu, S.; Melton, H. J.; Wang, C. Z.; Fisk, N.; du Plessis, L.; Hoehn, K. B.
Show abstract
Phylogenetic methods for cell lineage tracing have driven significant insights into organismal development, immune responses, and tumor evolution. While most methods estimate mutation trees, time-resolved lineage trees are more interpretable and could relate events like cellular migration and differentiation to perturbations like vaccines and drug treatments. However, somatic mutation rates vary dramatically by cell type, significantly biasing existing methods. We introduce TyCHE (Type-linked Clocks for Heterogeneous Evolution), a Bayesian phylogenetics package that infers time-resolved phylogenies of populations with distinct evolutionary rates. We demonstrate that TyCHE improves tree accuracy using a new simulation package SimBLE (Simulator of B cell Lineage Evolution). We use TyCHE to infer patterns of memory B cell differentiation during HIV infection, dynamics of recall germinal centers following influenza vaccination, evolution of a glioma tumor lineage, and progression of a bacterial lung infection. TyCHE and SimBLE are available as open-source software packages compatible with the BEAST2 and Immcantation ecosystems.
Taschler, B.; Nichols, T. E.; Ganjgahi, H.
Show abstract
Normative models produce per-subject deviation scores that feed directly into downstream analyses, but typical pipelines (i) treat confounders with ad-hoc or purely linear adjustments, and (ii) pass point estimates of deviation scores directly to the downstream model, ignoring uncertainty. We propose an integrated, two-module Bayesian framework that aims to address both limitations. A normative module based on Bayesian Additive Regression Trees (BART) flexibly captures non-linear effects and higher-order interactions while marginalising over image-quality variables via counterfactual averaging. Crucially, we define individual deviation as di = E[Y|Xi,Zi] - (Zi) with (Z) the feature-conditional population mean, not as a residual. A SoftBART survival model then ingests the full posterior distribution of deviation scores via a cut-posterior construction, propagating upstream uncertainty while blocking feedback from the outcome model. Across challenging simulations and a large clinical data set of multiple sclerosis patients (N>8k), the integrated approach yields better calibration, prediction accuracy and time-varying hazard separation between groups than a two-step plug-in Cox regression model. Modularised inference with BART-based normative deviations improves both flexibility and uncertainty quantification, and extends naturally to other outcomes beyond survival.
Tellaetxe-Elorriaga, I.; Jimenez-Marin, A.; Diez, I.; Erramuzpe, A.; Cortes, J. M.
Show abstract
The preclinical phase of Alzheimers disease (AD) is characterized by profound biological and structural heterogeneity, challenging our ability to map early pathology onto large-scale brain networks. To address this fundamental challenge, we introduce Functional Deviation Maps ({pi}z), an individualized neuroimaging framework for mapping participant-specific functional architecture to their unique structural atrophy landscape. By fitting a normative model to the voxel-based morphometry of amyloid-negative individuals, we extract personalized "atrophy seeds" (W-scores [≤] -1.96) for amyloid-positive patients, subsequently obtaining their resting-state seed-based connectivity (SBC). By standardizing these participant-level SBC maps against a healthy reference distribution, we show that, despite the highly variable spatial origins of structural atrophy, individual functional deviations converge into a common "atrophy network". Spatial enrichment analyses show that the functional disruption is not random, but preferentially is dominated by the Default Mode Network. Furthermore, by projecting these populational functional deviations onto high-order cognitive topographies, we find a considerable alignment with the brains fundamental unimodal-transmodal and external-internal attentional gradients. Overall, the{pi} z framework transcends conventional group-level averages, offering a highly personalized, biologically meaningful signature of system-level network vulnerability in the earliest stages of AD.
Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.
Show abstract
Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.
Yang, K.; Shi, P.; Huang, H.; Musio, F.; Baazaoui, H.; Aydin, O. U.; Hilbert, A.; Hamadache, R. E.; Yalcin, C.; Zhang, M.; Falcetta, D.; de la Rosa, E.; Shit, S.; Prabhakar, C.; Wittmann, B.; Rokuss, M. R.; Kirchhoff, Y.; Al-Maskari, R.; Hoeher, L.; Juchler, N.; Casamitjana, A.; Cleary, J.; Schmick, A.; Baumgartner, P.; Deseoe, J.; Vandans, O.; Lee, D.; Oh, K.; LaBella, D.; Mazher, M.; Niederer, S. A.; Qayyum, A.; Liu, Y.; Chen, J.; Kim, W.; Asawalertsak, N.; Kim, M.; Shin, D.; Park, S.-H.; Kikuchi, S.; Zhang, Y.; Liu, J.; Cui, Y.; Qiu, Y.; Verschuur, A.; Zhang, J.; van der Schaaf, I.; Su, R.;
Show abstract
We present the TopBrain 2025 Challenge, the first benchmark for fine-grained multiclass segmentation of the whole brain vasculature in both computed tomography angiography (CTA) and magnetic resonance angiography (MRA). Building on the TopCoW challenge, TopBrain scales vessel annotation from the Circle of Willis to the entire brain, introducing a dataset of 90 annotated volumes across 48 landmark vessel classes spanning arterial and venous systems, of which 50 training volumes are publicly released. Vessel definitions were consolidated from established neuroanatomical references into a unified annotation scheme, and vessel caliber measurements along the centerline are reported for the first time across the whole brain vascular anatomy. To address the unique challenges of multiclass brain vessel segmentation, we propose an evaluation framework that accounts for detection in segmentation performance, assesses anatomical plausibility, and introduces novel contamination metrics that characterize inter-class prediction errors. Fifteen teams from over 220 registered participants submitted algorithms to the benchmark. The top-performing teams built on nnUNet with principled system design choices, achieving around 80% Dice scores, near-zero invalid neighbor counts, over 60% F1 scores for side-road vessels, and below 18% foreground contamination ratio. Larger vessels are easier to segment, while smaller and more complex vessels remain the true bottleneck. The annotated datasets and podium-finish algorithms are made publicly available on Zenodo.
Kim, J.; Blalock, N.; Kulkarni, A.; Nakamura, K.; Romero, P. A.
Show abstract
Antibodies originate from germline templates and are diversified by somatic hypermutation, producing sequences in which conserved germline residues scaffold structure while rare non-germline (NGL) substitutions refine antigen binding. Current antibody language models (ALMs) treat all residues equivalently and inherit a germline bias that systematically down-weights functionally critical NGL mutations as statistical noise. We introduce PRISM, a germline-aware ALM that explicitly represents germline and nongermline residues as distinct token types over a factorized 53-token vocabulary. PRISM achieves state-of-the-art pseudo-perplexity in hypervariable CDRs and is uniquely positively correlated with experimental binding affinity across three deep mutational scanning landscapes on which all compared ALMs anti-correlate. The dual-vocabulary further enables property-specific controllable generation previously unattainable with entangled ALMs. NGL-directed sampling improves physics-based binding scores while GL-directed sampling preserves stability and solubility. These results establish disentangled germline/non-germline representation as a substantive advance in antibody language modeling.
Motta, S.; Santini, G.; Mansoor, S.; Nezhad, F. H.; Meli, M.; Pandini, A.
Show abstract
Biomolecular function is often controlled by structural and dynamical adaptations to binding events. Although molecular dynamics (MD) simulations can capture these events at atomic resolution, separating functional signatures from stochastic noise remains challenging. Traditional methods often struggle to isolate mechanistically relevant differences across independent replicas. Here, we introduce an explainable deep learning approach that learns state-specific dynamic signatures directly from MD trajectories. By coupling a dynamic protein graph representation with group-aware contrastive learning across independent replicas, the model detects the signatures, filtering out trajectory-specific correlations. An explainable AI framework then maps the identified differences on individual residues. We demonstrate this approach by identifying "binding-ready" conformations in a T4-Lysozyme mutant, recovering the allosteric determinants of peptide recognition in the PDZ3 domain, and isolating a ligand-independent activation signature for the A2A receptor. Our GISTnet-MD method generalizes across unseen data during comparative MD analysis, translating raw trajectory differences into residue-level determinants of protein function.
Shah, M.
Show abstract
Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disease affecting more than 450,000 individuals worldwide and is frequently diagnosed more than 12 months after symptom onset, delaying intervention during a critical early window. Because up to 80% of patients develop dysarthria within two years, subtle changes in speech provide a signal of early bulbar motor neuron degeneration. However, existing speech-based systems rely on supervised classification trained on limited datasets, achieving moderate sensitivity and depending heavily on labeled disease examples, which restrict scalability and early detection. This study introduces SPEAK-NORM, the first-ever normative speech modeling framework for early ALS diagnosis, which learns age- and sex-conditioned motor-speech distributions exclusively from healthy individuals. A conditional variational autoencoder models coordination of hypoglossal, laryngeal, and respiratory motor pathways, and deviation from this healthy manifold is quantified through latent representations and reconstruction error to form a 354-dimensional profile. A calibrated linear Support Vector Machine performs subject-level classification under subject-disjoint validation. On the VOC-ALS database (n = 153), SPEAK-NORM achieves 98% accuracy with balanced sensitivity and specificity, significantly outperforming established clinical acoustic indices and prior systems. The framework maintains strong performance under cross-task generalization and when retrained on healthy controls in independent dementia and Parkinson disease cohorts, demonstrating disease-specific deviation patterns rather than generic neurodegenerative change. Spectral, temporal, and latent separations further support interpretability. By modeling healthy speech instead of memorizing disease examples, SPEAK-NORM enables scalable early neuromotor screening using recording devices, with potential to support earlier diagnosis, differential classification, and monitoring of ALS progression.
Liu, R.; Jong, C.; Li, H.; Cao, Y.; Yao, Q.; Yamana, T.; Pei, S.; Du, H.
Show abstract
Effective pandemic response requires accurate modeling of population compliance with non-pharmaceutical interventions (NPIs), yet most epidemic models treat behavioral change as fixed scenarios rather than an emergent process. Here, we test whether large language model (LLM)-based agents can generate individualized behavioral responses to time-varying NPIs and disease risk. We instantiate demographically representative agents in three U.S. cities (Boston, Denver, San Antonio) and condition them on evolving outbreak conditions and policies during the early COVID-19 pandemic, without fitting to observed mobility data. Across three frontier LLMs and their ensemble, agents generate zero-shot mobility changes across restaurants, retail, and entertainment venues, benchmarked against cellphone-derived foot-traffic records. The simulations recover average mobility trends across cities and venue types but exhibit overly narrow within-city variation. The three LLMs display distinct biases, while an ensemble approach improves robustness and overall performance. These findings establish LLM agents as a promising framework for modeling adherence to NPIs and highlight the need for further fine-tuning and empirical validation before they can support policy analysis.
Alhasani, K. T.; Ghose, U.; Sammet, J.; Zhu, T.; Xiao, S.; Hastoy, B.; Brennan, P.; froud, K.; Ulm, B.; Duijn, C. v.; Winchester, L. M.; Marsden, B. D.; Nevado-Holgado, A.
Show abstract
Imaging genetics aims to understand how genetic variation influences brain structure and cognitive function. Traditional approaches often rely on imaging-derived phenotypes (IDPs), which require high-dimensional brain images to be reduced to predefined summary measures and may therefore miss subtle or spatially distributed genotype-related effects. We developed a two-stage framework that integrates deep learning and statistical modelling to derive and exploit brain-genotype scores--continuous, image-based representations of genetic variation learned directly from structural MRI. In the first stage, we trained a multi-task 3D convolutional neural network (CNN) on T1-weighted MRI scans from the UK Biobank, a large, population-based cohort, to predict single-nucleotide polymorphism (SNP) variation, producing brain-genotype scores that capture distributed neuroanatomical patterns associated with specific genetic variants. Unlike conventional IDPs, these scores are learned directly from raw images and are designed to encode genotype-related brain structure without reliance on predefined regional features. Gradient-based saliency maps were used to localise neuroanatomical regions contributing to each score, providing interpretable links between genetic variation and brain anatomy. In the second stage, brain-genotype scores derived from the held-out test set were used as quantitative neuroanatomical markers in association analyses with cognitive performance. These scores showed robust, Bonferroni-corrected associations with multiple cognitive measures, including fluid intelligence, reaction time, and memory performance. In contrast, traditional machine learning models trained on IDPs failed to generate comparably in-formative scores. This integrated framework demonstrates that brain-genotype scores provide a flexible and interpretable representation of genotype-related neuroanatomical variation, enabling the discovery of biologically meaningful links between genetic variation, brain structure, and cognition that are difficult to detect using traditional imaging genetic approaches.
Lee, S. H.; Wang, S.; Varkanitsa, M.; Kiran, S.
Show abstract
Macrolinguistic discourse analysis offers valuable insight into how patients with neurogenic communication disorders organize and produce informative speech, yet it remains a largely manual and labor-intensive process. We report an automated pipeline for macrolinguistic discourse analysis for individuals with aphasia and dementia that integrates automatic speech recognition (ASR), utterance segmentation, sentence-level embeddings, centroid-based main-concept matching, and rule-based coherence error classification. These algorithms were applied to Cinderella story retellings from 309 participants (113 controls, 102 post-stroke aphasia (PWA), and 94 dementia). The algorithm reliably identified main concepts (83% accuracy against human labels) and derived interpretable features such as semantic distance to a main concept centroid, main concept coverage, and coherence error rates. Crucially, diagnostic classification results showed that logistic-regression classifiers trained on 10 macrolinguistic features distinguished aphasia from controls with high accuracy (AUC {approx} 0.94) but showed weaker separation for dementia (controls vs dementia AUC {approx} 0.66; aphasia vs dementia AUC {approx} 0.58). Semantic distance to the centroid emerged as a robust, informative predictor for diagnostic classification, demonstrating that the ability to produce narrative-aligned speech is clinically important. The automated pipeline enables scalable macrolinguistic discourse analysis that could support screening and longitudinal monitoring of discourse impairments across neurogenic populations.
Liu, D.; Yu, Y.; Wu, Y. N.
Show abstract
The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce Thoughts-as-Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.
Zhong, L.; Bleichrodt, A.; Pandey, A.; Kunkel, D.; Rennert, L.
Show abstract
Wastewater-based epidemiology has emerged as a powerful complement to clinical surveillance for monitoring infectious disease dynamics. However, most existing approaches either treat wastewater sites in isolation, overlooking spatial dependencies, and often fail to account for variability in data quality, limiting their ability to generate reliable predictions of healthcare demand. Here we present a spatial Bayesian renewal framework that integrates wastewater surveillance with mobility-informed spatial interactions while incorporating reliability-weighted wastewater signals. We apply the framework to three major respiratory pathogens, i.e., SARS-CoV-2, influenza, and respiratory syncytial virus (RSV), using wastewater and hospital data from counties in South Carolina. Across rolling four-week forecasts, the spatial framework consistently outperforms non-spatial approaches and remains robust even in counties lacking direct wastewater or hospitalization observations. Importantly, we show that county-level forecasts can be translated into facility-level predictions, enabling localized assessment of healthcare demand. These forecasts provide actionable early-warning signals to support hospital capacity planning, staffing decisions, and resource allocation. Together, this work establishes a scalable digital surveillance framework that integrates heterogeneous data sources for enabling more reliable infectious disease forecasting and supporting public health decision-making in underserved and data-limited settings.
Oesinghaus, L.; Park, M.; Shao, R.; Koh, P. W.; Seelig, G.
Show abstract
Cytokine biology is dispersed across hundreds of thousands of publications, making it difficult to use systematically when interpreting new experiments. Large language models (LLMs) can assist with focused literature interpretation, but ad hoc retrieval remains incomplete and unreliable. We present the Cytokine Effect Database (CytED), a framework for interfacing user-supplied experimental datasets with literature knowledge at scale. CytED uses a multi-step LLM pipeline to generate over a million cytokine-cell type-effect triples from 110,000 full-text publications, with annotations for experimental context and directional changes in genes, pathways, and cellular processes. This structure enables quantitative comparison between observed perturbation responses and prior literature across cytokines, cell types, and experimental contexts. Applied to in vitro IL-10 stimulation of PBMCs, CytED identifies unexpected pro-inflammatory features in monocytes and systematic in vivo-in vitro differences in cytotoxicity responses in CD8+ T cells. CytED infers cytokine signaling, distinguishes primary from secondary cytokine effects, and guides the design of combinatorial perturbation screens. Together, CytED establishes a general paradigm for converting unstructured domain literature into analytical tools that bridge literature and experiment.
Kimpson, T.; Flegg, M. B.; Flegg, J. A.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWThe stochastic simulation algorithm (SSA) is widely used to perform exact forward simulation of discrete stochastic processes in biology. However, the computational cost, driven by sequential event-by-event sampling across large ensembles, remains a computational barrier. We investigate whether reduced-precision floating-point arithmetic can accelerate SSA without degrading statistical fidelity, drawing on the success of reduced-precision methods in weather and climate modelling. We evaluate two strategies across five canonical models (birth-death, Schlogl, Telegraph, dimerisation, repressilator): (i) mixed precision, computing propensities in 16-bit while maintaining accumulators in 32-bit; and (ii) uniform precision, performing all arithmetic in 16-bit. Mixed-precision SSA produces ensemble statistics that closely match the 64-bit reference for all models, as measured by Kolmogorov-Smirnov tests and Wasserstein distances. Under uniform precision, deterministic rounding introduces systematic biases across several models, with catastrophic failures in some cases. Stochastic rounding (SR) and propensity normalisation eliminate these biases, restoring distributional fidelity across all models tested (KS p > 0.05). Our results establish mixed-precision SSA with SR as a viable acceleration strategy for mathematical biology: 16-bit formats shrink per-variable data size by 2-4x relative to fp32/fp64, yielding comparable reductions in memory footprint and up to ~ 1.5x wall-clock speedup on CPU hardware that lacks native 16-bit arithmetic. As a hardware-level acceleration, mixed-precision SSA complements algorithmic methods such as tau-leaping and maps naturally onto modern GPU and TPU architectures with native 16-bit arithmetic.
Celotto, M.; Sooter, J. S.; Ährlund-Richter, S.; Jenks, K. R.; Sur, M.; Panzeri, S.
Show abstract
Identifying subpopulations of neurons that interact with each other from simultaneous recordings of populations of many neurons is key for understanding across-brain communication with cellular resolution. Recent work identified communication subspaces, which capture additive interactions between pairs of high-dimensional neural populations through a small number of source and target activity patterns. However, no current method captures how a third, potentially multivariate variable - such as behavioral state or the activity of a third population - modulates these interactions. Here we extend the communication subspace framework by parameterizing modulation as a low-rank tensor. This identifies multiplicative interaction channels (MICs), defined as triplets of source, target, and modulator activity patterns, in which the modulator pattern gates the source-target interaction. We derive MICs as a bilinear perturbation of reduced-rank regression. We develop a hierarchical fitting pipeline and provide a closed-form decomposition that quantifies whether modulation reshapes the modulator-averaged baseline interaction, recruits private dimensions of one population, or opens new interactions. In simulations, MICs reliably recover the presence and geometry of ground-truth modulation even in the high-dimensional, low-sample regime. Applying MICs to simultaneous calcium imaging of prefrontal axons and interneurons in the visual cortex revealed that behavioral state asymmetrically modulates top-down interactions, reconfiguring the patterns of prefrontal projections that interact with a stable set of visual interneuron activity patterns. By providing an efficient and compact characterization of modulatory interactions, MICs enable asking new questions about how potentially high-dimensional variables shape interactions between neural populations.
Vindas Yassine, Y. E.; Bornet, A.; Abbas, M.; Geissbuehler, D.; Rodrigues-Jr, J. F.; Teodoro, D.
Show abstract
Transmissible hospital-acquired infections (HAIs) arise from complex, time-varying interactions among patients, healthcare workers, and clinical environments. Although data-driven approaches like graph neural networks (GNNs) effectively model these contacts, they often function as black boxes that over-look established epidemiological principles, limiting interpretability and clinical trust. Inspired by physics-informed neural networks, we propose a epidemiology-informed GNN (EIGNN) framework for patient-level state transitions prediction in dynamic hospital settings, integrating mechanistic epidemiological models into GNNs in a principled manner. Patient-level risk factors learned from dynamic contact networks are jointly leveraged to infer latent epidemiological states, predict state transitions across multiple horizons, and estimate key epidemiological parameters, including transmission and recovery rates. We evaluate the approach on a real-world hospital-onset COVID-19 cohort and two public datasets simulating viral and bacterial HAIs. Across multiple architectures and horizons, EIGNNs achieves AUC-ROC up to 98.46% while providing interpretable, mechanistically consistent insights, offering a transparent tool for infection prevention and control.
Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.
Show abstract
Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis-Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12{degrees}C higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.
Huang, X.; Ang, A.; Vasoya, A. P.; Wang, Y.; Teresa, P.
Show abstract
Inferring gene regulation from time-course expression profiles is essential for understanding how cells transition between states during development, differentiation, and disease progression. Existing approaches often model expression dynamics with ordinary differential equations (ODEs). However, due to the computational complexity of directly solving these ODE models, most methods rely on finite-difference approximations of temporal derivatives, which can amplify measurement noise, introduce discretization bias, and lead to unstable or biased parameter estimates. To fill this gap, we develop the first computational method to directly learn a linear ODE model for gene regulation inference without relying on finite-difference approximations. We first formulate an optimization problem that directly exploits the closed-form solution of the linear ODE system. We then solve this problem via gradient descent, deriving analytical gradients with respect to the model parameters; these gradients involve matrix exponentials and integrals, which are challenging to directly compute. To make the computation efficient, we further use high-order Taylor approximations of the gradients whose truncation error is on the order of machine precision. In addition, we establish theoretical results demonstrating an inherent, non-vanishing gap between our exact solution and solutions derived from finite-difference approximations, which underscores the theoretical advantages of our approach. Finally, we demonstrate that our method consistently outperforms competing approaches on both simulated data and real-world scRNA-seq datasets in terms of AUROC. Our source codes can be accessed here: https://github.com/EJIUB/ExactLinearODE